Building Webcorpora of Academic Prose with BootCaT
نویسنده
چکیده
A procedure is described to gather corpora of academic writing from the web using BootCaT. The procedure uses terms distinctive of different registers and disciplines in COCA to locate and gather web pages containing them.
منابع مشابه
Stephen Hawking's Community-Bound Voice A Functional Investigation of Self-Mentions in Stephen Hawking's Scientific Prose
Thanks to the development of the concept of metadiscourse, it is now widely acknowledged that academic/scientific writing is not only concerned with communicating purely propositional meanings: what is communicated through academic/scientific communication is seen to be intertwined with the negotiation of social and interpersonal meanings. While a large number of so called metadiscoursal resour...
متن کاملGrawlTCQ: Terminology and Corpora Building by Ranking Simultaneously Terms, Queries and Documents using Graph Random Walks
In this paper, we present GrawlTCQ, a new bootstrapping algorithm for building specialized terminology, corpora and queries, based on a graph model. We model links between documents, terms and queries, and use a random walk with restart algorithm to compute relevance propagation. We have evaluated GrawlTCQ on an AFP English corpus of 57,441 news over 10 categories. For corpora building, GrawlTC...
متن کاملAvoiding Prolixity in Academic Prose; the Use of Quantity Metadiscourse in Research Articles
As part of a wider attempt to bestow the spirit of scholarly prose upon the research articles’ rhetorical structure, academic writers invariably take advantage of quantity metadiscourse markers to avoid prolixity and live up to the implicit and explicit maxims of quantity category as suggested in Gricean CP and similar models. In order to develop a clear understanding of quantity strategies di...
متن کاملPlenty of Fish in the Academy: On Marshall McLuhan’s Prose as an Anti-Environment
The purpose of this synthesis is to deconstruct the medium of Marshall McLuhan’s prose as an anti-environment for the medium of traditional academic writing. By placing McLuhan’s own theory in dialogue with the founding principles of linguistic anthropology, I will argue that McLuhan’s authorial tactics—a subject of his long-term repudiation by the academic community on the whole—adhered to the...
متن کاملRetrieving Japanese specialized terms and corpora from the World Wide Web
The BootCaT toolkit (Baroni and Bernardini, 2004) is a suite of perl programs implementing a procedure to bootstrap specialized corpora and terms from the web using minimal knowledge sources. In this paper, we report ongoing work in which we apply the BootCaT procedure to a Japanese corpus and term extraction task in the hotel terminology domain. The results of our experiments are very encourag...
متن کامل